Probabilistic ParaMor

نویسندگان

  • Christian Monson
  • Kristy Hollingshead
  • Brian Roark
چکیده

The ParaMor algorithm for unsupervised morphology induction, which competed in the 2007 and 2008 Morpho Challenge competitions, does not assign a numeric score to its segmentation decisions. Scoring each character boundary in each word with the likelihood that it falls at a true morpheme boundary would allow ParaMor to adjust the confidence level at which the algorithm proposes segmentations. A sliding threshold on segmentation confidence would, in turn, permit a trade off between precision and recall that could optimize F1 or other metrics of interest. Our submission to Morpho Challenge 2009 enriches ParaMor with segmentation confidences by training an off-the-shelf statistical natural language tagger to mimic ParaMor’s morphological segmentations. For a given word, the tagger’s probabilistic confidence that ParaMor would propose the character, c, as the first character of a new morpheme serves as the numeric score of the candidate morpheme boundary that immediately precedes c. We have trained a ParaMor tagger mimic over a development data set of 500,000 unique Hungarian word types. By adjusting the threshold above which the ParaMor mimic proposes morpheme boundaries, we improve ParaMor’s F1 score for Hungarian by 5.9% absolute, from 41.4% to 47.3%. Moreover, by training a probabilistic tagger to emulate the segmentations of a second unsupervised morphology induction system, Morfessor, we are able to combine ParaMor’s segmentation decisions with Morfessor’s to form a single joint segmentation of each word. Our joint ParaMor-Morfessor tagger mimic enhances F1 performance on our Hungarian development set by a further 3.4% absolute, ultimately achieving an F1 score of 50.7%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ParaMor: Finding Paradigms across Morphology1

ParaMor automatically learns morphological paradigms from unlabelled text, and uses them to annotate word forms with morpheme boundaries. ParaMor competed in the English and German tracks of Morpho Challenge 2007 (Kurimo et al., 2008). In English, ParaMor’s balanced precision and recall outperform at F1 an already sophisticated baseline induction algorithm, Morfessor (Creutz, 2006). In German, ...

متن کامل

ParaMor: Finding Paradigms across Morphology

Our algorithm, ParaMor, fared well in Morpho Challenge 2007 (Kurimo et al., 2007), a peer operated competition pitting against one another algorithms designed to discover the morphological structure of natural languages from nothing more than raw text. ParaMor constructs sets of affixes closely mimicking the paradigms of a language, and, with these structures in hand, annotates word forms with ...

متن کامل

Evaluating an Agglutinative Segmentation Model for ParaMor

This paper describes and evaluates a modification to the segmentation model used in the unsupervised morphology induction system, ParaMor. Our improved segmentation model permits multiple morpheme boundaries in a single word. To prepare ParaMor to effectively apply the new agglutinative segmentation model, two heuristics improve ParaMor’s precision. These precision-enhancing heuristics are adap...

متن کامل

ParaMor and Morpho Challenge 2008

ParaMor, our unsupervised morphology induction system performed well at Morpho Challenge 2008. When ParaMor's morphological analyses, which specialize at identifying inflectional morphology, are added to the analyses from the general purpose unsupervised morphology induction system, Morfessor, the combined system identifies the morphemes of all five Challenge languages at recall scores higher t...

متن کامل

ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis

Paradigms provide an inherent organizational structure to natural language morphology. ParaMor, our minimally supervised morphology induction algorithm, retrusses the word forms of raw text corpora back onto their paradigmatic skeletons; performing on par with state-ofthe-art minimally supervised morphology induction algorithms at morphological analysis of English and German. ParaMor consists o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009